DOCUMENT RESUME 



ED 247 .237 

AUTHOR 
TITLE 



PUB DATE 
NOTE 



PUB TYPE 



EDRS PRICE 
Dl^SCRIPTORS 



I&ENTIFIERS 



TM 840 381 

£)oolittler Allen E. 

Interpretation of Differential Item Performance 
Accompanied by Gender Differences in Academic 
Background. 
Apr 84 

15p,; Paper presented at the Annual Meeting of the 
American Educational Research Association (68th, New 
Orleans, LA, April 23-27, 1984). 
Speeches/Conference Papers (150) — Reports - 
Research/Techn ical ( 143 ) 

MFOl/PCOl Plus Postage. 

*Collegfe Entrance Examinations; Higher Education; 
*Item Analysis; Mathematics Achievement; Research 
Design; *Sex Differences; *Test Bias; Test Items 
ACT Assessment; *Dif f erential^Vtem Performance 



ABSTRACT i 

I Thr definition of dif f-erential item performance 

(DIP), often referic^d to as itepi bias, is discussed. DIP is suggested 
as a comprehensive t rm to encompass item bias (item invalidity which 
is unfair to certain > ?pulation. subgroups) a^id^ instructional bias (a 
valid reflection of group differenc^s in instruction or background) . 
This study investigated the plausibility of an instructional bias 
interpretation of DIP as it results from gender differences on 
mathematics achievement items. The data from a national 
administration of the Mathematics Usage subtest of the ACT Assessment 
wAs used in the investigation. The results indicated that there ,was 
the large instructional effect as predicted. However, there was also 
a smaller, gender effect on the performance of some items. / 
(Author/DWH) - ^ 



******************************************************** 

* Reproductions supplied' by EDRS are the best that can be made 

* from the original document. 

********************************************************************* 



V ± OEPARTMEMT OF EDUCATION 

NATIONAL INSTITUTE OF EDUCATION 
EDUCATIONAL' RESOURCES INFORMATION 

CENTER (EPIC) . 
^ This documeni has be«n repro<juced as 
received from the person or organization 
onginaling it. 
[ ; Minor charges have been made to improve 
rsproduction quality- 



Points of view or opinions stated in tliis docu- 
ment d-< not necBssarily repre«ent officii; NIE 
posit»on c policy. 



rvj 
o 

IxJ 



Interpretation ot Difterential Item Pertoraance 
Accompanied by Gender differences in Academic Background 



Allen E. Doolittle 



•PERMISSION TO REPRODUCE THIS 
MATERIAL HAS BEEN GRANTED BY 



The American Collide Testing Program 
P. 0. Box 168 
IOWA CITY, ibWA 52243 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC)." 



Paper presented at the annual meeting of the 

American Educational Research Association, New Orleans 

^ April, 1981 

-too ' 

1^ 



ERIC 



2 



^ ABSTRACT 



'The meaning oi: differential item pertoroance (DIP), often 
referred to as item biast is discussed. DIP is suggested 
here to encompass both item bias and instructional bias. 
Item bias is described as itea invalidity that is unfair to 
certain population subgroups. Instructional 'bias is de- 
scribed, as a valid reflection of group differences in in- 
struction or background. Using data from a national admin- 
istration of the ACT Assessment, this study investigated the 
plausibility of an instructional bias interpretation of DIP 
as it results from gender differences on mathematics ac- 
hievement items. ' The results indicated that there was the 
large instru^ctional effect as predicted, but there was also 
a smaller, gender effect on the performance of some items. 



INTBO0DCTIOH 



Duriny the past decade, there has been a grea. al of 
work with the construct frequently referred to as * tem 
bias". Most researchers now conclude that the tera "itea 
bias" is not sutf icie^ntly descriptive. Moreover, the common 
use of item bias as a synonya for teras such as 6 if f ereiitial 
item performance or item-group interaction is iai :>-.3cise and 
can lead to a misunderstanding about the nature the con- 
struct. Bias and„ hence, item bias are value-laden teras 
which imply unfairness. In achieveaent tests, the construct 
can and frequently does exist without unfairness. 

The confusion could l^e reduced by thinking of differen- 
tial item perforaance (DIP) as a coaprehensive term. In 
this sense, DIP refers to a kin^ of systematic item effect 
that works to the detriment of one group when compared to 
another. 

'\ 

Within the scope oi this definition, it is possible for ^ 
DIP to represent a systematic effect that is basicdlly un- 
fair, or actually biased^ against a group of examinees. In 
such . an instance, differential item performance would tie a 
form of item invalidity for that population subgroup and 
could be appropriately referred to as item bi^as. This situ- 
ation could exist with an item that measures^ in part, some- 
thing unrelated to intended test objective . In general, 
, this is just poor measurement, but when groups differ in the 
knowledge measured by the item, it is also unfair to the .de- * 
ficient individuals or groups. The inappropriate inclusion 
in a test of an item measuring some characteristic not rele- 
vant to the test objectives is oinfair or biased against 
those without the requisite backgrou^. 

On the other hand, it is also possible for differential 
item performance to reflect group differences in the ac- 
hievement of a relevant test objective^ Here, DIP would 
again represent a systematic 'effect, but this time the dif- 
ference in group performance would be a legitimate i^udica- 
tion of group differences in instruction or preparation. 
For instance, i;C a test is a measure of general chemistry 
achievement, organic chemistry iteas would pirobably exhibit 
"bias" against equally able students with only an inorganic 
chemistry background. However,' this is hot bias in the 
sense of item unfairness..; It is a valid reflection of in- 
sufficient instruction in organic chemistry. This form of 
DIP might be called "instructional bias". 



! 



- 2 - 



3 



fiesearch has shown that sale hicjh school students as a 
group periorm Detter than temale high school students on 
mathematics achievetient iteies (Fennema & Sherman, 1977) . A 
plausible explanation is that aale students typically re- 
ceive more and a higher level of instruction in matrfeaatics 
than do feuales. It so, one would expect that instances o£ 
differential itea pertormance, in the fora of instructional 
bias against females, might esiist in mathematics achievement 
tests. Simi'lar to the cheaistry example cited earlier, in- 
structional bias might be shown to exist for a complex alge- 
bra item if one group oi ^students has been instructed xn ad- 
vanced algebra and another ^rbup of students, egual in 
general ability, has not. 

The primary objective ot this research was to investigate 
the nature of DIP as it relates to gender differences in 
mathematics achievement^ items. Since female students as a 
group tend to be less well prepared in^athematics than ma- 
lei^r some instances of the instructional bias type of DIP 
were expected within a mathematics achievement test. In ad- 
dition, it was believed tha<t if instructional diff^ences 
were minimized, little evidence of DIP would be found. A 
decided reduction in evidence of differential item perfor- 
mance from an uncontrolled situation to an instructionally 
controlled situation would sjj^jg^t an instructional bias in- 
terpretation of the DIP tound'^an the first analysis. 

A secondary objective of this research was to present a 
multiple analysis approach to the study of DIP. It is un- 
likely that typical classification variables such as race 
and gender, in themselves, are major contributors to DIP. 
Th^ multiple analysis approach used here may be useful for 
studying >the extent to which DIP is a function of one of the 
usual^ classitication variables or of some other variable 
more directly linked t!o item performance. 



r 



HETHODOLOGT 

Data Sooroe . 

The data .source for this research was the pctober, 1981 
) administration of the ACT Assessment Mathematics Usage test 
(ACTM) to a sample of 4,000 college-bound, high school stu- 
dents. Of these 4,000 students, 1,668 (41.7%) were male and 
2,332 (58.3%) were female* As sh^own in' Table 1, the mean 
ACTtt scaled score for the males (16.1) was about one^ third 
of a standard devia^on higher than the mean for the females 
(13.7) • Also, the males averaged more semesters of mathe- 
matics coursework in a tour -year high school career (6 .2) 
than did females (5.6) • * 



TAbLE 1 

Subgroup Descriptive Statistics 



Hales Females 
(N=1668) (N=2332) 



ACTM ttathSems ACTtt Math Sems 



Wean 16.1 6*2 13.7 5-6 

S.D. 8.1 1.9 7.6 ■ 2.0 



/ 

The Lnstrument 

The ACT Assessment examination is an educational achieve- 
ment test containing four subtests, one of which is Mathe- 
matics Osage (ACTM) i T^e ACTM is a 40-item measure of 
mathematical reasoning ability. It emphc^sizes the solution 
of practical .quantitative pr'obleras that are encountered in 
many post secondary curricula and includes a sampling of 
mathematical techniques covered in high school courses. The 
test emphasizes quantitative reasoning rather than memoriza- 



6 



^tion of lormulas, kaowiedge of technic^ues, or computational 
'skill. 



Inde x of Differential I tea Performance 

A measure suggested by Linn and Harnisch (1981) was used 
as the index of DIP in this research. Although this medsure 
is based on item response theory, it aay be viewed as <i 
"saall sample" alternative to some of th^^ore frequently 
studied IRT indices. To calculate the index, the item and 
ability parameters of the three-parameter logistic model are 
estimated for the total sample. The target group is then 
separated from the rest of the, sample. (The target group 
can be any group of xnterest^ but it. is freguer^My thought 
of as d low -scoring group or one that may be adversely af- 
fected by DIP.) -The difterence is taken between each tSarget 
group examinee's computed probability of correctly answering 
the item and the examinee's actual response to the item 
(1-correct; 0-incorrect) . The index is ^this difference, 
standardized and averaged over all members of tlie target 
group. This index is considered a signed index. That is, 
the direction of the DIP is indicated. As calculated here, 
negative values represent DIP against the target group and 
positive values represent DIP favoring the target group. 

There are a couple of pote^ntial advantages of this proce- 
dure over other IRT approaches. The primary advantage is 
its applicability to relatively small samples. The usual, 
3-parameter IliT estimation procedures do not work well lor 
samples oi less than 1,000 (Linn & Harnisch, 19B1; Wood & 
Lord, 1976) . Since it is not uncommon for a subgroup to be 
±his. small or smaller, even when the overall size of the 
data set is quite large, the potential, value o£ a small-sam- 
' pie, IRT alternative is clear. A'lother advantage, according 
to Lfnn and Harnisch, is that the index is weighted by the 
actual distribution of examinee ability estimates (thetas) 
in the target group. This is an advantage over procedures 
based on simple comparisons of item characteristic curves 
for specified groups which can suggest differences in situa-. 
Ttions where there may be few observations. ^ Previous re- 
search has shown the Linn and Harnisch measure to be a sta- 
ble index and to be substantially correlated with other, 
perhaps more coiamon, measures o£ DIP (Doolittle, 1983) . 



* There are other IRT indices that also share this particu- 
lar advantage (see Shepard, Camilli, & Williams, 19b3) . 



EKLC 



7 



Instructional Backgroand Indicator 



Since ^ precise measure ot instructional background iix 
mathematics was not available, the members of the sample 
were classified on the basis of the number of semesters of 
mathematics instruction they received while in high school. 
For this research^ those^ who reported at least six semesters 
of mathematics (in an eight-semester high school career) 
^were considered the high background group; and those with 
less than six semesters weiJe considered the low background 
group. As d result, 69.3% of the males and 57.9% of the fe- 
males in the sample were placed in the high background cate- 
gory. ^ 



Besearc h Design * 

This study consisted of several sub-aUalyses based on 
vario^is divisions of the original sample (Table 2) . As the 
sample was divided differently, the variation in the numbers 
of items' with significant values of the index was expected 
to provide evidence as to th^ nature of the* DIP.' For an in- 
structional bias interpretation, the instances bf signif i- ^ 
cant DIP we^re expected to fluctuate as follows (alf:>o Table 
2, col,, 1) . ' 

When the sample was divided on the basis of gender by 
itself, a moderate number of items was expected to 
show DIP. / 

When the sample was divided solely on the basis of 
instructional background (6-8 math. seas. vs. 0-5 
math, sems.), a relatively large number of items were 
expected to exhibit DIP. ^ • ' 

When the sample was divided on the basis ot gender, 
but with instructional background held relatively 
constant for each group at either a high (a) or a low 
level (b) , few of the items were expected to show 
DIP. 

When the sample was divided on the basis of levels oi 
instruction, but with gender controlled, e.g . for ma- 
les (a) and tor females (b) , relativ^ely large nuBflbers 
of items were expepted to show DIP. - 

Column 2 of Table 2 suggests a possible outcoae o± these 
analyses if DIP, instead, were to -reflect actual "item bias** 
against females. 

Since the exact distribution of the Linn and Harnisch in- 
dex is not known under the assumption of no DIP, an apfroxi- 

1 




1^ 
2 . 

3. 



7 



mation to the distribution was calculated t6r each analysis. 
This procedure, suggested by Linn, Levine, Hastings, and 
Wardrop (1981), involved dividing a particular sample into 
essentially random halves and calculating the index on one 
of the halves as a pseudo target group. This was expected 
to represent a distribution ot index values for the null hy- 
pothesis situation. The highest absolute value in this dis- 
tribution wp*s taken as the critical value. Since the vari- 
ous analyses in this study involved different subsanples, 
the approximate null hypothesis distribution was uniquely' 
determined for each analysis. 



TA.HLE 2 

Hypotnesxzed Numbers of Items i::xhibiting DIP 

Instruct, bias Item bias 
Groups in.terpretation interpretation 



1. Male/female moderate moderate 

(baseline) (ba^seline) 

2. Strong background/weak background rel. high low 



3. a. Male, strong background/ ! 

female, strong background low moderate 

b. Male, weak backgi.ound/ 

female, weak background low moderate' 



.4. a. Strong background, male/ 
weak background, male 



rel. high 



low 



b. Strong background, female/ 
weak background, female 



rel^ high 



low 



EKLC 



/ 



BESOLTS 



The results shown in Table 3 were not entirely as expect- 
ed. Three items were identified as biased in the aale-fe- 
male analysis (Analysis !)• and seven were so identified in 
the high instruction-low instru.ction analysis (2) • In terms 
of numbers of biased items, these results were consistent 
with -expectations. Similarly, the' two items identified for 
the male low^f^male low analysis (3b), t)ie seven items iden- 
tified for. the male high-female high analysis (aa) , and the 
nine items identified tor the female high-female low analy- 
sis (4b) were in line with expectations. Thus, tjjfe of the^ 
six analysers produced ^numbers of biased items basi<:allY as 
hypothesized. However, the male, high instruction-female, 
high instruction analysis (3a) produced seven biased items. 



substantially more 



items than was expected. 



A further surprise was that the gender-oriented analyses 
(1,3a, 3b) and. the level -oriented analyses (2, 4a, 4b) did not 
yield similar results ..^ If gender were acting simply as a • 
correlate for instfuction, the signs for identified items 
should be the same across all analyses. However, without 
exception, items with significantly negative values of the 
index, tor any of the gender-oriented analyses, had positive^ 
signs if aiso identified by any of the instruction -oriented 
analyses, and vice versa. # • 

When the items with significant DIP were examined, some 
interesting pattei^n^S emerged. Relatively abstract items 
(strictly numbers and symbols), such as items 6 and 10, 
tended to favor the high instrucllion groups, and more con- 
crete, arithmetic reasoning itemsVc^ord problems) , such as 
items 12,25,27 and 28, tended to favor the low instruction 
groups. This relationship seems intuitively plausible since 
the ^ost instruction in abstract mathematics is likely to be 
received by the more advanced students — those with suffi-* 
cient prexjequisite coursevork. Thus, it would seem likely 



^ The six separate analyses may be logically grouped based 
on their focus. The gendor-oriented analyses are those 
that compare the item pertormance -of males and females re- 
gardless of whether or not level of instruction has been 
controlled. Similarly, the instructional level analyses 
are those that compare the item pe^rformance of high and 
' low instructional level examinees regardless of whether or 
not gender has been held constant. • 




thatl high instruction students would do relatively well on 
thes^j.teiQS • Conversely, it would seem to follow that low 
instruction examinees would perform relatively better on th 
more concrete mathematxcs items. 

Although the indications of DIP ,were less strong for thQ 
gender-oriented analyses than for the instructional level 
analyses, again there seemed to be some notable patterns. 
Geometry items (items 3 and 11) and, to some extent, arith- 
metic reasoning problems (ite^s 12 and 25) seemed to ad- 
versely affect the performance of females. Why these types 
of items appeared relatively more difficult .for females, 
however, is not readily apparent. 



11 



10 



'XAbLZ i -ft 
Significant hlv in A.CTfl Iteas* 



^ jl^ Ajialysis 

1 2 • 3a • 3b 4a - 4b 

ItetB rt/l'(.06) H/L(.06) MH/FH(.06) ML/PL(.06) MH/HL(.Ob) PH/PL(.Cr5) 

3 ' -.07 -.10 

.06 

^ -.1'^ -.22 ' -.16 

7 , -.TO .07 - -.10 -.11 

10 -.15 .08 ^ -.14 -.15 

11 -.07 -.08 ' -.C7 

12 -.06 .07 -.07 -.10 ol6 

13 • .07 ; -.05- 

I** ' .06 

1.6 -.07 ' -.12 

25 -.07 .06 

27 ' ^ .08 .15 

23 .09 .10 .09 

37 .07 

» Sig.: 3 7 * 7 2 7 9 

Predicted: laod high . low low high high 

* Significance determined by coaparison to the largest absolute value 
of the statistic calculated in a randpn, null hypothesis situa- 
tion. The obtained critical value of the index 
ij shown in parenthesis for each analysis. 

The analysis headings are: v 

M - male F - female 

H - high instruction^ L - J^ow instruction 

WH - male, high instruct. Pri - female, high instruct. 

ML - male, low instruct^ FL - fevale, low instruct. 

The second group under 'each hea^ling iiS the target group. A negative 
value represents DIP to the disadvantage of the target group 
while a positive value is DIP to the relative benefit of the 
target group. ' . 



12 



I^ISCOSSIOH 



Although the primary intention a£ this research was to 
clarify the nature o£ gender-relat Jd DIP in mathematics ac- 
hievement items, the results vere not quite- as expected. 
Evidence of DIP w$s found in the analyses focused on in- 
structional level and, to a slightly lesser extent', in the 
gender-oriented analyses. However, the fact that the direc- 
tion of the /DIP was not consistent for the same items across 
analyses~suggests that, in this situation, gender was more 
than a simple correlate of instructional level. The notion 
that gender -related DIP among mathematics achievement items 
is merely due to gender differences in instructional level 
was not supported. ^ 

The evident instances of DIP, seemingly due to gender and 
not instructional level, suggest the possibility of at least 
two alternative hypotheses. Perhaps the measure ofj.nstruc- 
tiona!l^ level used in the study was inadequate. Thif^ is, 
quantity 'or number of high school mathematics courses may 
not have been an appropriate measure. If there is a sub- 
stantial group to group discrepancy in the type or quality 
of instruction received, a measure based on quantity would 
indeed seem to be inadequate. Possibly this could )iave been 
enough of a problem to substantially impact these results. 

Another possible explanation is that there may be items 
that do, in fact, perform differently for males and females, 
regardless of instructional level. This may be akin to true 
item bias but, more likely , it^ suggests t^^e existence of 
some other' background variable, like examinee expectations, 
acculturation, or motivation,^ that could differentially af- 
fect group performance. 

The fact that there seemed to be certain groups of items 
that favored one group over another indicates that future 
researcJi emphasizing item type or item content might be 
fruitful. A possible avenue for this kind of research is 
the experimental design approach suggested by Schneiser 
(1983) . When thea^e are indications that certain types of 
items may be biased ag.ainst one group or another, the use of 
relevant analysis of variance procedures could provide a 
practical means for examining the differential impact of, 
for example, geometry items on gender groups. From an ex-* 
ploratory research perspective, it might also be helpful to 
simply ask female examinees why they had trouble with cer- 
tain geometry and arithmetic reasoning items. 



The secondary objective of this ^study was to present a 
multiple analysis approach for the investigation of DIP. 
Such an approach seems particularly useful for learning wore 
about the nature \of DIP* Hany "item bias" studies have, 
stopped at the level of . the sale-female analysis in this 
study. That is, xf any evidence of DIP is found, -there is a 
tendency to conclude thdt particular items are biased 
against one group. or another. It is argued here that such 
an interpretation may be premature. Although the results of 
this research did itCt clearly demonstrate the potential 
problems with the typical item bias study, the suggested 
methodology should provide a means to explore DIP at a deep- 
er level. The multiple analysis approach can be used tp in- 
ves'^Tigate likely causal variables, such as instruction, in / 
addition to the more typical group classification variables, 
such /as race or gender. 



/ 



\ 



14 



REFEREBCES 



Do6little> A. E. The reliability of measuring differential 
itegi performance. £H1C #ED 234061. Paper presented at 
the annual i&eeting of the American Educational Research 
Association, Montreal, April, 1983. 

Pennemdf, E. L., and Sherman, J. Sex-related differences in 
mathematics achievement, spatial visualization and 
a'ffective factors. American Educational Research 
J ourna l. 1977, 14, bl-71. 

Linn, R. L., Levine, n. V., Hastings, C. N., and Wardrop, J. 
L. Item bias in a test of reading comprehension. 
A ppl ied PsYcholoqicd l Measur ement, 1981, 5(2), 159-173. * 

Linn, R. L^., ana Uarnisch, D. L. Interactions between item 
content and group membership on achievement test items. 
J ourna l of Educational Measurement , 1981, 18(2), 109-118. 

Schmeiser, C. Differences between black and white 

examinee performance on the ACT Assessment examination as 
a function of the racial orientation of test content. 
Doctoral dissertation. The University of Iowa, Iowa City, 
1983. 

Shepard, L., Camilli, G., and Williams, D. n. Accounting 
for statistical artifacts in item bias research. Paper 
presented at the annual meeting of the American 
Educational Research Association, Montreal, April, 1983. 

Wood, R. L., and Lord, F. M. A Oser *s Guide to LOGIS^. 
Research Memorandum. Princeton, N. J.: Educational^^ 
Testing Service, 1976. 



-r 13 - 15 



